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Abstract. Technological innovations have revolutionized the process of scientific research 
and knowledge discovery. The availability of massive data and challenges from frontiers 
of research and development have reshaped statistical thinking, data analysis and the- 
oretical studies. The challenges of high-dimensionality arise in diverse fields of sciences 
and the humanities, ranging from computational biology and health studies to financial 
engineering and risk management. In all of these fields, variable selection and feature 
extraction are crucial for knowledge discovery. We first give a comprehensive overview 
of statistical challenges with high dimensionality in these diverse disciplines. We then 
approach the problem of variable selection and feature extraction using a unified frame- 
work: penalized likelihood methods. Issues relevant to the choice of penalty functions are 
addressed. We demonstrate that for a host of statistical problems, as long as the dimen- 
sionality is not excessively large, we can estimate the model parameters as well as if the 
best model is known in advance. The persistence property in risk minimization is also 
addressed. The applicability of such a theory and method to diverse statistical problems 
is demonstrated. Other related problems with high-dimensionality are also discussed. 
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1. Introduction 

Technological innovations have had deep impact on society and on scientific re- 
search. They allow us to collect massive amount of data with relatively low cost. 
Observations with curves, images or movies, along with many other variables, are 
frequently seen in contemporary scientific research and technological development. 
For example, in biomedical studies, huge numbers of magnetic resonance images 
(MRI) and functional MRI data are collected for each subject with hundreds of 
subjects involved. Satellite imagery has been used in natural resource discovery 
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and agriculture, collecting thousands of high resolution images. Examples of these 
kinds are plentiful in computational biology, climatology, geology, neurology, health 
science, economics, and finance among others. Frontiers of science, engineering and 
the humanities differ in the problems of their concerns, but nevertheless share one 
common theme: massive and high-throughput data have been collected and new 
knowledge needs to be discovered using these data. These massive collections of 
data along with many new scientific problems create golden opportunities and 
significant challenges for the development of mathematical sciences. 

The availability of massive data along with new scientific problems have re- 
shaped statistical thinking and data analysis. Dimensionality reduction and fea- 
ture extraction play pivotal roles in all high-dimensional mathematical problems. 
The intensive computation inherent in these problems has altered the course of 
methodological development. At the same time, high-dimensionality has signif- 
icantly challenged traditional statistical theory. Many new insights need to be 
unveiled and many new phenomena need to be discovered. There is little doubt 
that the high dimensional data analysis will be the most important research topic 
in statistics in the 21st century 

Variable selection and feature extraction are fundamental to knowledge discov- 
ery from massive data. Many variable selection criteria have been proposed in the 
literature. Parsimonious models are always desirable as they provide simple and 
interpretable relations among scientific variables in addition to reducing forecasting 
errors. Traditional variable selection such as Cp, AIC and BIC involves a combina- 
torial optimization problem, which is NP-hard, with computational time increasing 
exponentially with the dimensionality. The expensive computational cost makes 
traditional procedures infeasible for high-dimensional data analysis. Clearly, inno- 
vative variable selection procedures are needed to cope with high-dimensionality. 

Computational challenges from high-dimensional statistical endeavors forge 
cross-fertilizations among applied and computational mathematics, machine learn- 
ing, and statistics. For example, Donoho and Elad [201 show that the NP-hard best 
subset regression can be solved by a penalized Li least-squares problem, which 
can be handled by a linear programming, when the solution is sufficiently sparse. 
Wavelets are widely used in statistics function estimation and signal processing 
PiniinillSEniESllMlEni- Algebraic statistics, the term coined by Pistone, 
Riccomagno, Wynn |72|, uses polynomial algebra and combinatorial algorithms 
to solve computational problems in experimental design and discrete probability 
|72| . conditional inferences based on Markovian chains [Ifij . parametric inference 
for biological sequence analysis |71| . and phylogenetic tree reconstruction |77j . 

In high-dimensional data mining, it is helpful to distinguish two types of statis- 
tical endeavors. In many machine learning problems such as tumor classifications 
based on microarray or proteomics data and asset allocations in finance, the inter- 
ests often center around the classification errors, or returns and risks of selected 
portfolios rather than the accuracy of estimated parameters. On the other hand, 
in many other statistical problems, concise relationship among dependent and in- 
dependent variables are needed. For example, in health studies, we need not only 
to identify risk factors, but also to assess accurately their risk contributions. These 
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are needed for prognosis and understanding the relative importance of risk factors. 
Consistency results are inadequate for assessing the uncertainty in parameter esti- 
mation. The distributions of selected and estimated parameters are needed. Yet, 
despite extensive studies in classical model selection techniques, no satisfactory 
solutions have yet been produced. 

In this article, we address the issues of variable selection and feature extraction 
using a unified framework: penalized likelihood methods. This framework is ap- 
plicable to both machine learning and statistical inference problems. In addition, 
it is applied to both exact and approximate statistical modeling. We outline, in 
Section 2, some high-dimensional problems from computational biology, biomedi- 
cal studies, financial engineering, and machine learning, and then provide a unified 
framework to address the issues of feature selection in Sections 3 and 4. In Sec- 
tions 5 and 6, the framework is then applied to provide solutions to some problems 
outlined in Section 2. 



2. Challenges from sciences and humanities 

We now outline a few problems from various frontiers of research to illustrate 
the challenges of high-dimensionality. Some solutions to these problems will be 
provided in Section 6. 

2.1. Computational biology. Bioinformatic tools have been widely ap- 
plied to genomics, protcomics, gene networks, structure prediction, disease diag- 
nosis and drug design. The breakthroughs in biomedical imaging technology allow 
scientists to monitor large amounts of diverse information on genetic variation, 
gene and protein functions, interactions in regulatory processes and biochemical 
pathways. Such technology has also been widely used for studying neuron ac- 
tivities and networks. Genomic sequence analysis permits us to understand the 
homologies among diff'erent species and infer their biological structures and func- 
tionalities. Analysis of the network structure of protein can predict the protein 
biological function. These quantitative biological problems raise many new sta- 
tistical and computational problems. Let us focus specifically on the analysis of 
microarray data to illustrate some challenges with dimensionality. 

DNA microarrays have been widely used in simultaneously monitoring mRNA 
expressions of thousands of genes in many areas of biomedical research. There are 
two popularly-used techniques: c-DNA microarrays ;5 and Affymetrix GeneChip 
arrays 60 . The former measures the abundance of mRNA expressions by mix- 
ing mRNAs of treatment and control cells or tissues, hybridizing with cDNA on 
the chip. The latter uses combined intensity information from 11-20 probes in- 
terrogating a part of the DNA sequence of a gene, measuring separately mRNA 
expressions of treatment and control cells or tissues. Let us focus further on the 
cDNA microarray data. 

The first statistical challenge is to remove systematic biases due to experiment 
variations such as intensity effect in the scanning process, block effect, dye effect. 
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batch effect, amount of mRNA, DNA concentration on arrays, among others. This 
is collectively referred to as normalization in the literature. Normalization is criti- 
cal for multiple array comparisons. Statistical models are needed for estimation of 
these systematic biases in presence of high-dimensional nuisance parameters from 
treatment effects on genes. See, for example, lowess normalization in |25','82|, semi- 
parametric model-based normalization by |35II3K[I^ . and robust normalization in 
|62|. The number of significantly expressed genes is relatively small. Hence, model 
selection techniques can be used to exploit the sparsity. In Section 6.1, we briefly 
introduce semiparametric modeling techniques to issues of normalization of cDNA 
microarray. 

Once systematic biases have been removed, the statistical challenge becomes 
selecting statistically significant genes based on a relatively small sample size of 
arrays (e.g. n = 4, 6, 8). Various testing procedures have been proposed in the lit- 
erature. See, for example, |29l Kffil 1^ 1821 l8Hj . In carrying out simultaneous testing 
of orders of hundreds or thousands of genes, classical methods of controlling the 
probability of making one falsely discovered gene are no longer relevant. Therefore 
various innovative methods have been proposed to control the false discovery rates. 
See, for example, [^ I21l f24 26 , 43, 56, 76 . The fundamental assumption in these 
developments is that the null distribution of test statistics can be determined ac- 
curately. This assumption is usually not granted in practice and new probabilistic 
challenge is to answer the questions how many simultaneous hypotheses can be 
tested before the accuracy of approximations of null distributions becomes poor. 
Large deviation theory 03 IHl] is expected to play a critical role in this en- 
deavor. Some progress has been made using maximal inequalities |54j . 

Tumor classification and clustering based on microarray and proteomics data 
are another important class of challenging problems in computational biology. 
Here, hundreds or thousands of gene expressions are potential predictors, and 
the challenge is to select important genes for effective disease classification and 
clustering. See, for example, |78l 1811 IH7j for an overview and references therein. 

Similar problems include time-course microarray experiments used to deter- 
mine the expression pathways over time |78[ 17^ and genetic networks used for 
understanding interactions in regulatory processes and biochemical pathways |57| . 
Challenges of selecting significant genes over time and classifying patterns of gene 
expressions remain. In addition, understanding genetic network problems requires 
estimating a huge covariance matrix with some sparsity structure. We introduce 
a modified Cholesky decomposition technique for estimating large scale covariance 
matrices in Section 6.1. 



2.2. Health studies. Many health studies are longitudinal: each subject is 
followed over a period of time and many covariates and responses of each subject 
are collected at different time points. Framingham Heart Study (FHS), initiated 
in 1948, is one of the most famous classic longitudinal studies. Documentation of 
its first 50 years can be found at the website of National Heart, Lung and Blood 
Institute | http://www.nhlbi.nih.gov/about/framingham/ 1. One can learn more 
details about this study from the website of American Heart Association. In brief, 
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the FHS follows a representative sample of 5,209 adult residents and their offspring 
aged 28-62 years in Framingham, Massachusetts. These subjects have been tracked 
using (a) standardized biennial cardiovascular examination, (b) daily surveillance 
of hospital admissions, (c) death information and (d) information from physicians 
and other sources outside the clinic. 

In 1971 the study enrolled a second-generation group to participate in similar 
examinations. It consisted of 5,124 of the original participants' adult children and 
their spouses. This second study is called the Framingham Offspring Study. 

The main goal of this study is to identify major risk factors associated with 
heart disease, stroke and other diseases, and to learn the circumstances under 
which cardiovascular diseases arise, evolve and end fatally in the general popula- 
tion. The findings in this studies created a revolution in preventive medicine, and 
forever changed the way the medical community and general public view on the 
genesis of disease. In this study, there are more than 25,000 samples, each consist- 
ing of more than 100 variables. Because of the nature of this longitudinal study, 
some participant cannot be followed up due to their migrations. Thus, the col- 
lected data contain many missing values. During the study, cardiovascular diseases 
may develop for some participants, while other participants may never experience 
with cardivoscular diseases. This implies that some data are censored because the 
event of particular interest never occurred. Furthermore, data between individuals 
may not be independent because data for individuals in a family are clustered and 
likely positively correlated. Missing, censoring and clustering are common features 
in health studies. These three issues make data structure complicated and iden- 
tification of important risk factors more challenging. In Section 6.2, we present a 
penalized partial likelihood approach to selecting significant risk factors for cen- 
sored and clustering data. The penalized likelihood approach has been used to 
analyze a data subset of Frammingham study in 9 . 

High-dimensionality is frequently seen in many other biomedical studies. For 
example, ecological momentary assessment data have been collected for smoking 
cessation studies. In such a study, each of a few hundreds participants is provided 
a hand-held computer, which is designed to randomly prompt the participants five 
to eight times per day over a period of about 50 days and to provide 50 questions at 
each prompt. Therefore, the data consist of a few hundreds of subjects and each of 
them may have more than ten thousand observed values [IHI ■ Such data are termed 
intensive longitudinal data. Classical longitudinal methods are inadequate for such 
data. Walls and Schafer jSQ presents more examples of intensive longitudinal data 
and some useful models to analyze this kind of data. 

2.3. Financial engineering and risk management. Technological 
revolution and trade globalization have introduced a new era of financial markets. 
Over the last three decades, an enormous number of new financial products have 
been created to meet customers' demands. For example, to reduce the impact of 
the fluctuations of currency exchange rates on corporate finances, a multinational 
corporation may decide to buy options on the future of exchange rates; to reduce 
the risk of price fluctuations of a commodity (e.g. lumbers, corns, soybeans), a 
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farmer may enter into a future contract of the commodity; to reduce the risk of 
weather exposures, amusement parks and energy companies may decide to pur- 
chase financial derivatives based on the weather. Since the first options exchange 
opened in Chicago in 1973, the derivative markets have experienced extraordi- 
nary growth. Professionals in finance now routinely use sophisticated statistical 
techniques and modern computing power in portfolio management, securities reg- 
ulation, proprietary trading, financial consulting, and risk management. For an 
overview, see [2H1 and references therein. 

Complex financial markets (SOI make portfolio allocation, asset pricing and 
risk management very challenging. For example, the price of a stock depends 
not only on its past values, but also its bond and derivative prices. In addition, 
it depends on prices of related companies and their derivatives, and on overall 
market conditions. Hence, the number of variables that influence asset prices can 
be huge and the statistical challenge is to select important factors that capture 
the market risks. Thanks to technological innovations, high-frequency flnancial 
data are now available for an array of different financial instruments over a long 
time period. The amount of financial data available to financial engineers is indeed 
astronomical. 

Let us focus on a specific problem to illustrate the challenge of dimensionality. 
To optimize the performance of a portfolio [T^ or to manage the risk of a 
portfolio j69j , we need to estimate the covariance matrix of the returns of assets in 
the portfolio. Suppose that we have 200 stocks to be selected for asset allocation. 
There are 20,200 parameters in the covariance matrix. This is a high-dimensional 
statistical problem and estimating it accurately poses challenges. 

Covariance matrices pervade every facet of financial econometrics, from asset 
allocation, asset pricing, and risk management, to derivative pricing and propri- 
etary trading. As mentioned earlier, they are also critical for studying genetic 
networks as well as other statistical applications such as climatology 53^. In 
Section 6.1, a modified Cholesky decomposition is used to estimate huge covariance 
matrices using penalized least squares approach proposed in Section 2. We will 
introduce a factor model for covariance estimation in Section 6.3. 

2.4. Machine learning and data mining. Machine learning and data 
mining extend traditional statistical techniques to handle problems with much 
higher dimensionality. The size of data can also be astronomical: from grocery 
sales and financial market trading to biomedical images and natural resource sur- 
veys. For an introduction, see the books 46,i43- Variable selections and feature 
extraction are vital for such high-dimensional statistical explorations. Because of 
the size and complexity of the problems, the associated mathematical theory also 
differs from the traditional approach. The dimensionality of variables is compa- 
rable with the sample size and can even be much higher than the sample size. 
Selecting reliable predictors to minimize risks of prediction is fundamental to ma- 
chine learning and data mining. On the other hand, as the interest mainly lies in 
risk minimization, unlike traditional statistics, the model parameters are only of 
secondary interest. As a result, crude consistency results suffice for understand- 
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ing the performance of learning theory. This eases considerably the mathematical 
challenges of high-dimensionality. For example, in the supervised (classification) 
or unsupervised (clustering) learning, we do not need to know the distributions of 
estimated coefficients in the underlying model. We only need to know the variables 
and their estimated parameters in the model. This differs from high-dimensional 
statistical problems in health sciences and biomedical studies, where statistical in- 
ferences are needed in presence of high-dimensionality. In Sections 4.2 and 6.4, we 
will address further the challenges in machine learning. 



3. Penalized least squares 

With the above background, we now consider the variable selection in the least- 
squares setting to gain further insights. The idea will be extended to the likelihood 
or pseudo-likelihood setting in the next section. We demonstrate how to directly 
apply the penalized least squares approach for function estimation or approxima- 
tion using wavelets or spline basis, based on noisy data in Section 5. The penalized 
least squares method will be further extended to penalized empirical risk minimiza- 
tion for machine learning in Section 6.4. 

Let {xi. Hi}, i — 1, - ■ ■ , n, be a random sample from the linear regression model 

y = x^/3 + £, (3.1) 

where e is a random error with mean and finite variance , and /3 = (/3i , • • • , (3d)'^ 
is the vector of regression coefficients. Here, we assume that all important predic- 
tors, and their interactions or functions are already in the model so that the full 
model H3.1|) is correct. 

Many variable selection criteria or procedures are closely related to minimize 
the following penalized least squares (PLS) 

-.71 d 

-^(y.-xf/3)2 + ^p,^ (1/3,1), (3.2) 
" i=i j=i 

where d is the dimension of x, and px^ (•) is a penalty function, controlling model 
complexity. The dependence of the penalty function on j allows us to incorporate 
prior information. For instance, we may wish to keep certain important predictors 
in the model and choose not to penalize their coefficients. 

The form of p\- (•) determines the general behavior of the estimator. With the 
entropy or Lo-penalty, namely, p\j{\Pj\) — ^X^I{\(3j\ ^ 0), the PLS (|3.2|) becomes 

1 " 1 

-J2{y,-^J(3r + -X'\M\, (3.3) 

i=l 

where \M\ = J2j 0)) the size of the candidate model. Among models 

with m variables, the selected model is the one with the minimum residual sum of 
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squares (RRS), denoted by RSS,„. A classical statistical method is to choose m 
by maximizing the adjusted R^, given by 

_ n-1 RSS„t 
^adj.m - i r> -m RSSi ' 

IL III IVOkJl 

or equivalently by minimizing RSSm/(?T. — m), where RSSi is the total sum of 
squares based on the null model (using the intercept only). Using log(l + x) « a; 
for small x, it follows that 

log{RSS„,/(n - m)} w (loga^ _ i) + cr^^/lRSS™ + -ma^}. (3.4) 

n n 

Therefore, maximization of -Radj,m is asymptotically equivalent to minimizing the 
PLS with A = a/y/n. Similarly, generalized cross-validation (GCV) given by 

GCV(m) = RSS„/{n(l - m/n)^} 

is asymptotically equivalent to the PLS with A — \/2a/^/n and so is the 

cross-validation (CV) criterion. 

Many popular variable selection criteria can be shown asymptotically equiv- 
alent to the PLS (|3.3|l with appropriate values of A, though these criteria were 
motivated from different principles. See |68j and references therein. For instance, 
RIC TT corresponds to A = y^2\og{d){a-/ y/n). Since the entropy penalty function 
is discontinuous, minimizing the entropy-penalized least-squares requires exhaus- 
tive search, which is not feasible for high-dimensional problem. In addition, the 
sampling distributions of resulting estimates are hard to derive. 

Many researchers have been working on minimizing the PLS (|3.2|l with Lp- 
penalty for some p > 0. It is well known that the L2-penalty results in a ridge 
regression estimator, which regularizes and stabilizes the estimator but introduces 
biases. However, it does not shrink any coefficients directly to zero. 

The Lp-penalty with < p < 2 yields bridge regression intermediating the 
best-subset (Lo-penalty) and the ridge regression (L2-penalty). The non-negative 
garrote |H] shares the same spirit as that of bridge regression. With the ii-penalty 
specifically, the PLS estimator is called LASSO in [HOI. In a seminal paper, Donoho 
and Elad j2()l show that penalized Lo-solution can be found by using penalized Li- 
method for sparse problem. When p < I, the PLS automatically performs variable 
selection by removing predictors with very small estimated coefficients. 

Antoniadis and Fan pP discussed how to choose a penalty function for wavelets 
regression. Fan and Li |32| advocated penalty functions with three properties: 

a. Sparsity: The resulting estimator should automatically set small estimated 

coefficients to zero to accomplish variable selection. 

b. Unbiasedness: The resulting estimator should have low bias, especially when 

the true coefficient /3j is large. 

c. Continuity: The resulting estimator should be continuous to reduce instability 

in model prediction. 
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To gain further insights, let us assume that the design matrix X = (xi, • • • ,x„)''" 
for model (|3.1|l is orthogonal and satisfies that iX"^X — Id- Let z = (X"'"X)^^X"'"y 
be the least squares estimate of /3. Then 1)3. 2|l becomes 

i-||y-Xz|| + i|lz-/3f + ^p,^(|/3,|). 

Thus, the PLS reduces to a componentwise minimization problem: 

min{^(^j - Po? + Px, I)}, for J = 1, • • • , d, 

where Zj is the j-th component of z. Suppress the subscript j and let 

Q(/3) = i(z~/3)2+p,(|/3|). (3.5) 

Then the first order derivative of Q{I3) is given by 

Q'(/3) = /3- z + pl(|/3|)sgn(/3) = sgn(/3){|/3| +Mm " ^- 

Antoniadis and Fan 1 and Fan and Li [35] derived that the PLS estimator pos- 
sesses the following property: 

(a) sparsity if min;3{|/3| + p'^{\(3\)} > 0; 

(b) unbiasedness = for large \f3\; 

(c) continuity if and only if argmin^{|/3| + p';)j(|/3|)} = 0. 

The ip-penalty with < p < 1 does not satisfy the continuity condition, the Li 
penalty does not satisfy the unbiasedness condition, and Lp with p > 1 does not 
satisfy the sparsity condition. Therefore, none of the Lp-penalties satisfies the 
above three conditions simultaneously, and Li-penalty is the such penalty that 
is both convex and produces sparse solutions. Of course, the class of penalty 
functions satisfying the aforementioned three conditions arc infinitely many. Fan 
and Li [32] suggested the use of the smoothly clipped absolute deviation (SCAD) 
penalty defined as 

r A|/3|, ifO<|/3|<A; 
Pxm) = { -(|/3p-2aA|/3|+A2)/{2(a-l)}, if A < < aA; 
[ (a + l)AV2, if |/3| >aX. 

They further suggested using a = 3.7. This function has similar feature to the 
penalty function A|/3|/(l + 1/3|) advocated in [Slj- FigureHdepicts the SCAD, L0.5- 
penalty, Li-penalty, and hard thresholding penalty (to be introduced) functions. 
These four penalty functions are singular at the origin, a necessary condition for 
sparsity in variable selection. Furthermore, the SCAD, hard-thresholding and io.5 
penalties are nonconvex over (0, -l-oo) in order to reduce the estimation bias. 

Minimizing the PLS ()3.5|l with the entropy penalty or hard-thresholding penalty 
P\{P) = A^ — (A — \ j3\)\ (which is smoother) yields the hard-thresholding rule [23 
(3h — zI{\z\ > A). With the Li-penalty, the PLS estimator is /35 = sgn(z)(|z|— A)+, 
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the soft-thresholding rule O ■ The L2-penalty results in the ridge regression 
/^i? = (1 + A)~^z and the SCAD penalty gives the solution 

( sgn(z)(|z| - A)+, when |z| < 2A; 

PscAB = I {(a — l)z — sgn(2;)aA}/(a — 2), when 2A < |2;| < aA; 
[ z, when \z\ > aX. 

These functions are also shown in Figure ^ The SCAD is an improvement over 
the Lp-penalty in two aspects: saving computational cost and resulting in a con- 
tinuous solution to avoid unnecessary modeling variation. Furthermore, the SCAD 
improves bridge regression by reducing modeling variation in model prediction. Al- 
though similar in spirit to the Li-penalty, the SCAD also improves the Li-penalty 
by avoiding excessive estimation bias since the solution of the Li-penalty could 
shrink all regression coefficients by a constant, e.g., the soft thresholding rule. 



4. Penalized likelihood 

PLS can easily be extended to handle a variety of response variables, including 
binary response, counts, and continuous response. A popular family of this kind 
is called generalized linear models. Our approach can also be applied to the case 
where the likelihood is a quasi-likelihood or other discrepancy functions. This will 
be demonstrated in Section 6.2 for analysis of survival data, and in Section 6.4 for 
machine learning. 

Suppose that conditioning on x^, yi has a density f{g{x[/3),yi}, where 5 is a 
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known inverse link function. Define a penalized likelihood as 

n d 

" 1=1 j=i 

Maximizing the penalized likelihood results in a penalized likelihood estimator. 
For certain penalties, such as the SCAD, the selected model based on the noncon- 
cave penalized likelihood satisfies /3j = for certain /3j's. Therefore, parameter 
estimation is performed at the same time as the model selection. 

Example 1. (Logistics Regression) Suppose that given x;, yi follows a Bernoulli 
distribution with success probabihty = l|xi} =p{xi). Take5(u) — exp(w)/(l+ 
exp(u)), i.e. p(x) = exp(x^/3)/{l + exp(x-^/3)}. Then H4.1|l becomes 

n d 

- ^[y,(xf /3) - log{l + exp(xf/3)}] - J^pa, 

1=1 j=i 

Thus, variable selection for logistics regression can be achieved by maximizing the 
above penalized likelihood. 

Example 2. (Poisson Log-linear Regression) Suppose that given x^, yi follows a 
Poisson distribution with mean A(xi). Take g{-) to be the log-link, i.e. A(x) = 
exp(x^/3). Then H4.1|l can be written as 

^ f]{y.(xf/3) - exp(xf/3)} - ^pa,(|/?,|) 
" 1=1 j=i 

after dropping a constant. Thus, maximizing the above penalized likelihood with 
certain penalty functions yields a sparse solution for (3. 

4.1. Oracle properties. Maximizing a penalized likelihood selects variables 
and estimates parameters simultaneously. This allows us to establish the sampling 
properties of the resulting estimators. Under certain regularity conditions. Fan 
and Li ^! demonstrated how the rates of convergence for the penalized likelihood 
estimators depend on the regularization parameter A„ and established the oracle 
properties of the penalized likelihood estimators. 

In the context of variable selection for high-dimensional modeling, it is natural 
to allow the number of introduced variables to grow with the sample sizes. Fan 
and Peng [SJ have studied the asymptotic properties of the penalized likelihood 
estimator for situations in which the number of parameters, denoted by c?„, tends 
to oo as the sample size n increases. Denote /3„g to be the true value of f3. To 
emphasize the dependence of Xj on n, we use notation A„ j for Xj in this subsection. 
Define 



a„ = max{p';)^ . (|^„oj |) : /3nOj 7^ 0} and 6„ = max{|p';( . (l^nOj |)| : ^nOj 7^ 0}. 

(4.2) 
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Fan and Peng [35 showed that if both a„ and &„ tend to as n ^ oo, then 
under certain regularity conditions, there exists a local maximizer (3 of Q{(3) such 
that 

!|3 - /3„oll = Op{^,{n-^/^ + a„)}. (4.3) 

It is clear from (|4.3I) that by choosing a proper A„j such that a„ = 0(n~^/^), 
there exists a root-(n/(i„) consistent penalized likelihood estimator. For example, 
for the SCAD, the penalized likelihood estimator is root-(n/dn) consistent if all 
A„j 's tend to 0. 

Without loss of generality assume that, unknown to us, the first s„ components 
of /3„oi denoted by /3„oi' ^''^ nonzero and do not vanish and the remaining c?„ — s„ 
coefficients, denoted by /3„o2; ^^'^ 0- Denote by 

£ = diag{p';(„_^(|/?„oi|), • • • ,Pa„,,„ (l/3„osJ)} 

and ^ 
b= {p'\„.ii\PnOi\)sgn-{PnOi),--- ,PA„,,„(l^»06j)sgn(/3„os„)) . 

Theorem 1. Assume that as n oo, mini<j<s^ \PnOj\l ^n,j ~^ oo and that the 
penalty function px. {\(3j\) satisfies 

liminf liminfp', . (/3,)/A„ , > 0. (4.4) 

If ^n,j ^ 0, y^n/dnXnj ^ OO and d^/n as n ~* oo, then with probability 

tending to 1, the root n/dn consistent local maximizers (3 = (/3„i, /3„2)"^ must 
satisfy: 

(i) ("Sparsity; 3„2 = 0/ 

(a) (Asymptotic normality^ for any qx Sn matrix A„ such that A„A^ G, 
a q X q positive definite symmetric matrix, 

VHA„I-^/'{Ii + S} [Xi - /3„io + (Ii + S)-ib} ^ N{0, G) 

where Ii = Ii(/3„]^q, 0), f/ie Fisher information knowing /3„2o = 0. 

The theorem implies that any finite set of elements of /3„]^ are jointly asymptot- 
ically normal. For the SCAD, if all Aj_„ — > 0, a„ = 0. Hence, when ^n/(i„A„j- — > 

00, its corresponding penalized likelihood estimators possess the oracle property, 

1. e., perform as well as the maximum likelihood estimates for estimating (3^i know- 
ing /3„2 = 0. That is, with probability approaching to 1, 

3„2-0, and V^A„iy'(3„i-/3„io)-^iV(0,G). 

For the Li-penalty, a„ = maxj Aj^„. Hence, the root-n/c?„ consistency requires 
that Xnj = 0{\J dn/n). On the other hand, the oracle property in Theorem 
2 requires that y^n/dn^nj oo. These two conditions for LASSO cannot be 
satisfied simultaneously. It has indeed been shown that the oracle property does 
not hold for the Li-penalty even in the finite parameter setting j8^) . 
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4.2. Risk minimization and persistence. In machine learning such 
as tumor classifications, the primary interest centers on the misclassification errors 
or more generally expected losses, not the accuracy of estimated parameters. This 
kind of properties is called persistence in |41l I42| . 

Consider predicting the response Y using a class of model g{x^ (3) with a loss 
function £{g{X^ f3),Y). Then, the risk is 

L„{f3)^Ee{g{X^f3),Y}, 

where n is used to stress the dependence of dimensionality c? on n. The minimum 
risk is obtained at /3* = argmin^L„(/3). In the penalized likelihood context, 

£ = — log/. Suppose that there is an estimator /3„ based on a sample of size n. 
This can be done by the penalized empirical risk minimization similarly to (|4.1|l : 

n d 

^{g(xf /3), y,} + J^Pa, (|/3, |), (4.5) 

i=l 3=1 

based on a set of training data {(x^, j/^), i = 1, • • • , n}. The persistence requires 

L„(3)-L„(/3*) ^0, (4.6) 

but not the consistency of /3 to /3* . This is in general a much weaker mathematical 
requirement. Greenshtein and Ritov [i2\ show that if the non-sparsity rate Sn = 
0{(?i/ logn)^/^} and d„ — n" for some a > 1, LASSO (penalized Li least-squares) 
is persistent under the quadratic loss. Greenshtein jlj extends the results to the 
case where s„ = 0{n/logn} and more general loss functions. Meinshausen 
considers a case with finite non-sparsity s„ but with logd„ = n^, with ^ € (0, 1). 
It is shown there that for the quadratic loss, LASSO is persistent, but the rate to 
persistency is slower than a relaxed LASSO. This again shows the bias problems 
in LASSO. 

4.3. Issues in practical implementation. In this section, we address 
practical implementation issues related to the PLS and penalized likelihood. 

Local quadratic approximation (LQA). The Lp, (0 < p < 1), and SCAD 
penalty functions are singular at the origin, and they do not have continuous second 
order derivatives. Therefore, maximizing the nonconcave penalized likelihood is 
challenging. Fan and Li |32| propose locally approximating them by a quadratic 
function as follows. Suppose that we are given an initial value /3*^ that is close to the 
optimizer of Q{I3). For example, take initial value to be the maximum likelihood 
estimate (without penalty). Under some regularity conditions, the initial value is 
a consistent estimate for /3, and therefore it is close to the true value. Thus, we 
can locally approximate the penalty function by a quadratic function as 

PA„(|/3,|) «PA„(|/3^|) + \{AM\)/\l3'i\W, - /3f ), for /3, « (4.7) 

To avoid numerical instability, we set (3j = if is very close to 0. This cor- 
responds to deleting Xj from the final model. With the aid of the LQA, the 



14 



Jianqing Fan and Runze Li 



optimization of penalized least-squares, penalized likelihood or penalized partial 
likelihood (see Section 6.2) can be carried out by using the Newton-Raphson algo- 
rithm. It is worth noting that the LQA should be updated at each step during the 
course of iteration of the algorithm. We refer to the modified Newton-Raphson 
algorithm as the LQA algorithm. 

The convergence property of the LQA algorithm was studied in [ST], whose 
authors first showed that the LQA plays the same role as the E-step in the EM 
algorithm ^HI- Therefore the behavior of the LQA algorithm is similar to the 
EM algorithm. Unlike the original EM algorithm, in which a full iteration for 
maximization is carried out after every E-step, we update the LQA at each step 
during the iteration course. This speeds up the convergence of the algorithm. The 
convergence rate of the LQA algorithm is quadratic which is the same as that of 
the modified EM algorithm [55] . 

When the algorithm converges, the estimator satisfies the condition 

dm/dp,+np'^^{%\)sgn$,)^Q, 

the penalized likelihood equation, for non-zero elements of /3. 

Standard error formula. Following conventional techniques in the likelihood 
setting, we can estimate the standard error of the resulting estimator by using the 
sandwich formula. Specifically, the corresponding sandwich formula can be used as 
an estimator for the covariance of the estimator f3i^ the non- vanishing component 
of /3. That is, 

cot(3i) = - nSA(3i)}-^c5V{V£(3i)}{V2^(3i) - nUSi)}-'. (4-8) 

where cov{V^(^]^)} is the usual empirically estimated covariance matrix and 
Sa(3i) =diagK^(|^i|)/|^i|,.- - ,p';,^J|^.J)/|^.J} 

and s„ the dimension of (3^. Fan and Peng [^ demonstrated the consistency of 
the sandwich formula: 

Theorem 2. Under the conditions of Theorem 1, we have 

A„cow(3i)A^ - A„S„A^ -^0 asn^ (X, 
for any matrix A„ such that A„A^ = G, where S„ = (Ii + S)^^Ij^"'^(Ii + Yi)^^ . 

Selection of regularization parameters. To implement the methods described 
in previous sections, it is desirable to have an automatic method for selecting 
the thresholding parameter A in p\{-) based on data. Here, we estimate A via 
minimizing an approximate generalized cross-validation (GCV) statistic in jll) . 
By some straightforward calculation, the effective number of parameters for Q{f3) 
in the last step of the Newton-Raphson algorithm iteration is 

e(A) = e(Ai, • • • , A<i) = tv[{^''m + Sa(3)}"'v2^(3)]. 
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Therefore the generalized cross-validation statistic is defined by 
GCV(A) = -e0)/[n{l - e(A)/n}2] 

and A = argmin_i^{GCV(A)} is selected. 

To find an optimal A, we need to minimize the GCV over a (i„-dimensional 
space. This is an unduly onerous task. Intuitively, it is expected that the magni- 
tude of Xj should be proportional to the standard error of the maximum likelihood 
estimate of Pj. Thus, we set A = Xse{/3yii^^) in practice, where se(/3iy[LE) denotes 
the standard error of the MLE. Therefore, we minimize the GCV score over the 
one-dimensional space, which will save a great deal of computational cost. The 
behavior of such a method has been investigated recently. 



5. Applications to function estimation 

Let us begin with one-dimcnsional function estimation. Suppose that we have 
noisy data at possibly irregular design points {xi, • • • , Xn}- 

Hi = m{xi) + £i, 

where m is an unknown regression and e^'s are iid random error following -/V(0, (t^). 
Local modeling techniques I^OI have been widely used to estimate m(-). Here we 
focus on global function approximation methods. 

Wavelet transforms are a device for representing functions in a way that is local 
in both time and frequency domains (131 1141 IS5ll64| . During the last decade, they 
have received a great deal of attention in applied mathematics, image analysis, sig- 
nal compression, and many other fields of engineering. Daubechies |17j and Meyer 
inn are good introductory references to this subject. Wavelet-based methods have 
many exciting statistical properties j22] . Earlier papers on wavelets assume the 
regular design points, i.e, Xi — ^ (usually n = 2^ for some integer k) so that fast 
computation algorithms can be implemented. See and references therein. For 
an overview of wavelets in statistics, see |8f)| . 

Antoniadis and Fan jlj discussed how to apply wavelet methods for function 
estimation with irregular design points using penalized least squares. Without loss 
of generality, assume that m{x) is defined on [0, 1]. By moving nondyadic points 
to dyadic points, we assume Xi = for some rii and some fine resolution J 

that is determined by users. To make this approximation errors negligible, we take 
J large enough such that 2'^ > n. Let W be a given wavelet transform at all 
dyadic points {i/2'^ :, i = 1, • • • , 2'' — 1}. Let N = 2'' and be the ri^-th column 
of W, an X TV matrix, and /3 — Wm be the wavelet transform of the function 
m at dyadic points. Then, it is easy to see that m{xi) — SL[f3. This yields an 
overparameterized linear model 



Vi = af/3 + £i, 



(5.1) 
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which aims at reducing modehng biases. However, one cannot find a reasonable 
estimate of /3 by using the ordinary least squares method since N > n. Directly 
applying penalized least squares, we have 

-. n N 

-J2{y,-af(3f + Y^p,^{\P^\). (5.2) 

i=i j=i 

If the sampling points are equally spaced and n = 2"^ , the corresponding design 
matrix of linear model H5.ll) becomes a square orthogonal matrix. From the discus- 
sion in Section 3, minimizing the PLS H5.2|l with the entropy penalty or the hard- 
thresholding penalty results in a hard-thresholding rule. With the Li penalty, the 
PLS estimator is the soft-thresholding rule. Assume that pxi') is nonnegative, non- 
decreasing, and differentiable over (0, cx)) and that function —(3~p'.^{(3) is strictly 
unimodal on (0,oo), p')^{-) is nonincreasing andp^(O-l-) > 0. Then, Antoniadis and 
Fan [T| showed that the resulting penalized least-squares estimator that minimizes 
(|5.2|l is adaptively minimax within a factor of logarithmic order as follows. Define 
the Besov space ball ^(C) to be 

B;JC) = {meL,: J2i2^^^+'/'-'/P^\\e,.\\,)'' < C}, 

3 

where 6j. is the vector of wavelet coefficients of function m at the resolution level 
j. Here r indicates the degree of smoothness of the regression functions m. 

Theorem 3. Suppose that the regression function rn(-) is in a Besov ball with r-f 
1/2 — 1/p > 0. Then, the maximum risk oj the PLS estimator fh{-) over B^ g{C) is 
of rate 0{n'~^^^^^^~^^"> log(n)) when the universal thresholding ■y/21og(n)/n is used. 
It also achieves the rate of convergence 0{n~^'"/*-^'""*"^Mog(n)} when the minimax 
thresholding pn / y/n is used, where Pn is given in 0'. 

We next consider multivariate regression function estimation. Suppose that 
{xi,?/i} is a random sample from the regression model 

y = m(x) -I- e, 

where, without loss of generality, it is assumed that x G [0, 1]''. Radial basis and 
neural-network are also popular for approximating multi-dimensional functions. In 
the literature of spline smoothing, it is typically assumed that the mean function 
m(x) has a low-dimensional structure. For example, 

m(x) = /io + ^mj{xj) + ^mki{xk,xi). 
j k<i 

For given knots, a set of spline basis functions can be constructed. The two most 
popular spline bases are the truncated power spline basis 1, x, x^, a;'^, (x — tj)^, (j = 
1, • • • , J), where ij's are knots, and the B-spline basis (see [H] for definition). The 
i?-spline basis is numerically more stable since the multiple correlation among the 
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basis functions is smaller, but the power truncated spline basis has the advantage 
that deleting a basis function is the same as deleting a knot. 

For a given set of 1-dimensional spline bases, we can further construct a mul- 
tivariate spline basis using tensor products. Let {Bi, ■ ■ ■ ,Bj} be a set of spline 
basis functions on [0,1]''. Approximate the regression function to(x) by a linear 
combination of the basis functions, f3jBj(Tc), say. To avoid a large approxima- 
tion bias, we take a large J. This yields an overparameterized linear model, and 
the fitted curve of the least squares estimate is typically undersmooth. Smoothing 
spline suggested penalizing the roughness of the resulting estimate. This is equiv- 
alent to the penalized least squares with a quadratic penalty. In a series of work 
by Stone and his collaborators (see [7S]), they advocate using regression sphnes 
and modifying traditional variable selection approaches to select useful spline sub- 
bases. Ruppert et al [Tl] advocated penalized sphnes in statistical modehng, in 
which power truncated splines are used with the L2 penalty. Another kind of 
penalized splines method proposed by [23 shares the same spirit of |7^ . 



6. Some solutions to the challenges 

In this section, we provide some solutions to problems raised in Section 2. 

6.1. Computational biology. As discussed in Section 2.1, the first statis- 
tical challenge in computational biology is how to remove systematic biases due 
to experiment variations. Thus, let us first discuss the issue of normalization of 
cDNA-microarrays. Let Yg be the log-ratio of the intensity of gene g of the treat- 
ment sample over that of the control sample. Denote by Xg the average of the 
log-intensities of gene g at the treatment and control samples. Set rg and Cg be 
the row and column of the block where the cDNA of gene g resides. Fan et al [311 
use the following model to estimate the intensity and block effect: 

Yg^ag + l3r^+lc,+f{Xg)+eg, 5 = 1, •••,7V (6.1) 

where ag is the treatment effect on gene /J^g and are block effects that are 
decomposed into the column and row effect, f{Xg) represents the intensity effect 
and N is the total number of genes. Based on J arrays, an aim of microarray 
data analysis is to find genes g with ag statistically significantly different from 
0. However, before carrying multiple array comparisons, the block and treatment 
effects should first be estimated and removed. For this normalization purpose, 
parameters ag are nuisance and high-dimensional (recall N is in the order of tens 
of thousands). On the other hand, the number of significantly expressed genes is 
relatively small, yielding the sparsity structure of ag . 

Model (|6.1|) is not identifiable. Fan et al |^ use within-array replicates to 
infer about the block and treatment effects. Suppose that we have / replications 
for G genes, which could be a small fraction of N. For example, in 3^, only 
111 genes were repeated at random blocks (G — 111,/ = 2), whereas in all 
genes were repeated three times, i.e. / = 3 and N — 3G, though both have about 



18 



Jianqing Fan and Runze Li 



N « 20, 000 genes printed on an array. Using / replicated data on G genes, model 
(|6.1|l becomes 

Yg^=ag+(3r,.+lc,,+.fiXg^)+eg^, <? = 1 G; I = 1 ,•••,/ . (6.2) 

With estimated coefficients (3 and 7 and the function /, model Ht).l|l implies that 
the normalized data are Y* —Yg — (3rg — Icg — fi^g) even for non-repeated genes. 

Model 16.2|1 can be used to remove the intensity effect array by array, though 
the number of nuisance parameters is very large, a fraction of total sample size 
in Ht).2|) . To improve the efficiency of estimation. Fan et al aggregate the 
information from other microarrays (total J arrays): 

Yg^j = "g + + 7c,, J + f]{Xg^J) + Eg,, j = 1, • • • , J, (6.3) 

where the subscript j denotes the array effect. 

The parameters in (|6.2|) can be estimated by the profile least-squares method 
using the Gauss-Seidel type of algorithm. See [33] for details. To state the results, 
let us write model (|6.2|l as 

Yg,^ag + 7Fg^(3 + f{Xg,)+eg,, (6.4) 

by appropriately introducing the dummy variable Z. Fan et al 35' obtained the 
following results. 

Theorem 4. Under some regularity conditions, as n ~ IG 00, the profile 
least-squares estimator of model \(i.4\j has 

V^(3 -fi)^N [i\ -L_a^^-^^ , 

where S = E{Var{Z\X)} and cr'^ = Var{e). In addition, J{x)-f{x) = Op{n-'^/^). 

Theorem 5. Under some regularity conditions, as n = IG — > 00, when X and Z 
are independent, the profile least-squares estimator based on ]6.3\) possesses 

The above theorems show that the block effect can be estimated at rate Op{n^^/^) 
and intensity effect / can be estimated at rate Op(n~^/^). This rate can be im- 
proved to Op{n^^/^ + N~^^^) when data in (|6.1|l are all used. The techniques have 
also been adapted for the normalization of Affymetrix arrays 129| . Once the arrays 
have been normalized, the problem becomes selecting significantly expressed genes 
using the normalized data 

Ygj^ag+Egj, Q = 1 , ' ' ' , N ^ 1 , ■ ■ ■ , J , (6.5) 

where Y*j is the normalized expression of gene g in array j. This is again a high- 
dimensional statistical inference problem. The issues of computing P-values and 
false discovery are given in Section 2.1. 
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Estimation of high-dimensional covariance matrices is critical in studying ge- 
netic networks. PLS and penalized likelihood can be used to estimate large scale co- 
variance matrices effectively and parsimoniously 0S1ISHI- Let w = {Wi, • • • , Wd)^ 
be a d-dimensional random vector with mean zero and covariance E. Using the 
modified Cholesky decomposition, we have LSL"^ = D, where L is a lower trian- 
gular matrix having ones on its diagonal and typical element —(t>tj in the (t, j)th 
position for 1 < j < t < c?, and D = diag{cr^, • • • ,cr^)"^ is a diagonal matrix. 
Denote e = Lw ^ {ei, - ■ ■ , Cd)'^ ■ Since D is diagonal, ei, • • • , are uncorrelated. 
Thus, for 2 < i < d 

t-1 

W^t =^(/.i,M(,+et. (6.6) 

That is, the Wt is an autoregressive (AR) series, which gives an interpretation 
for elements of L and D, and allows us to use PLS for covariance selection. We 
first estimate af using the mean squared errors of model 1)6.6(1 . Suppose that w^, 
i = 1, • • • , n, is a random sample from w. For t ^ 2, ■ ■ ■ , d, covariance selection 
can be achieved by minimizing the following PLS functions: 

ji t-1 t-1 

i=l j=l j=l 

This reduces the non-sparse elements in the lower triangle matrix L. With esti- 
mated L, the diagonal elements can be estimated by the sample variance of the 
components in Lwi. The approach can easily be adapted to estimate the sparse 
precision matrix See |66) for a similar approach and a thorough study. 

6.2. Health studies. Survival data analysis has been a very active research 
topic because survival data are frequently collected from reliability analysis, med- 
ical studies, and credit risks. In practice, many covariates are often available as 
potential risk factors. Selecting significant variables plays a crucial role in model 
building for survival data but is challenging due to the complicated data struc- 
ture. Fan and Li [23] derived the nonconcave penalized partial likelihood for Cox's 
model and Cox's frailty model, the most commonly used semiparametric models in 
survival analysis. Cai, et al [5] proposed a penalized pseudo partial likelihood for 
marginal Cox's model with multivariate survival data and applied the proposed 
methodology for a subset data in the Framingham study, introduced in Section 
2.2. 

Let T, C and x be respectively the survival time, the censoring time and their 
associated covariates. Correspondingly, let Z = min{T, C} be the observed time 
and 5 = I{T < C) be the censoring indicator. It is assumed that T and C are 
conditionally independent given x, that the censoring mechanism is noninforma- 
tive, and that the observed data {(x^, Zi,Si) : i = 1, • • • ,n} is an independently 
and identically distributed random sample from a certain population (x, Z, S) . The 
Cox model assumes the conditional hazard function of T given x 



/i(t|x) = /io(t)exp(x^/3), 



(6.8) 
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where ho{t) is an unspecified baseline hazard function. Let t° < • • • < denote 
the ordered observed failure times. Let (j) provide the label for the item failing at 
i° so that the covariates associated with the N failures are ^(i), ■ ■ ■ ,^(n)- Let Rj 
denote the risk set right before the time i° : Rj ^ {i : Zi > t'j}. Fan and Li 
proposed the penalized partial likelihood 

N d 

Q{(3) = ^[x^.)/3-log{^ exp(xf/3)}] -n^p,^ (1/3,1). (6.9) 

The penalized likelihood estimate of (3 is to maximize (|6.9|l with respect to (3. 

For finite parameter settings, Fan and Li showed that under certain reg- 
ularity conditions, if both a„ and 5„ tend to 0, then there exists a local maxi- 
mizer /3 of the penalized partial likelihood function in l|6.9|) such that ||/3 — /3g|| = 
). They further demonstrated the following oracle property. 

Theorem 6. Assume that the penalty function p\^{\j3\) satisfies condition {4-4}) - 
If ^n,j ~> 0, \/n\n,j — !■ oo and a„ — 0(n^^/^), then under some mild regularity 
conditions, with probability tending to 1, the root n consistent local maximizer 

/3 — {(3i ,/32 )^ of Q{(3) defined in Iff. .9)) must satisfy 

^2 = 0, and V^(/i + S){3i-/3io + (/i + S)-ib} ^iV{0,/i(/3io)}, 

where Ii is the first s x s submatrix of the Fisher information matrix I{I3q) of the 
partial likelihood. 

Cai, et al Q investigated the sampling properties of penalized partial likeli- 
hood estimate with a diverging number of predictors and clustered survival data. 
They showed that the oracle property is still valid for penalized partial likelihood 
estimation for the Cox marginal models with multivariate survival data. 

6.3. Financial engineering and risk management. There are many 
outstanding challenges of dimensionality in diverse fields of financial engineering 
and risk management. To be concise, we focus only on the issue of covariance 
matrix estimation using a factor model. 

Let Yi be the excess return of the i-th asset over the risk-free asset. Let 
/ii ■ ■ ■ : fK be the factors that infiucnce the returns of the market. For example, 
in the Fama- French 3-factor model, /i, /2 and /a are respectively the excessive 
returns of the market porfolio, which is the value-weighted return on all NYSE, 
AMEX and NASDAQ stocks over the one-month Treasury bill rate, a portfolio 
constructed based on the market capitalization, and a portfolio constructed based 
on the book-to-market ratio. Of course, constructing factors that infiuence the 
market itself is a high-dimensional model selection problem with massive amount 
of trading data. The ii'-factor model |^ |73j assumes 



Yi = biifi H h biKfK + ? = 1, • • • , d. 



(6.10) 
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where {si} are idiosyncratic noises, uncorrelated with the factors, and d is the 
number of assets under consideration. This an extension of the famous Capital 
Asset Pricing Model derived by Sharpe and Lintner (See |1(J[I12| '). Putting it into 
the matrix form, wc have y = Bf + e so that 

S = Var(Bf) + Var(£) = BVar(f)B^ + So, (6.11) 

where S = Var(y) and So = Var(e) is assumed to be diagonal. 

Suppose that we have observed the returns of d stocks over n periods (e.g., 3 
years daily data). Then, applying the least-squares estimate separately to each 
stock in 1)6.10(1 . we obtain the estimates of coefficients in B and Sq. Now, esti- 
mating Var(f) by its sample variance, we obtain a substitution estimator S using 
(|6.11|l . On the other hand, we can also use the sample covariance matrix, denoted 
by Ssam, as an estimator. 

In the risk management or portfolio allocation, the number of stocks d can be 
comparable with the sample size n so it is better modeled as dn ■ Fan et al in- 
vestigated thoroughly when the estimate S outperforms Sgam via both asymptotic 
and simulation studies. Let us quote some of their results. 

Theorem 7. Let Afe(S) be the k-th largest eigenvalue of H. Then, under some 
regularity conditions, we have 



max 

l<fe<d„ 



Afc(S)-Afc(S) =op{(lognd2/„)i/2} ^ max Afc(S3am) - Afe(S) 

l<k<dn 



For a selected portfolio weight with 1"^^^ = 1, we have 

|^S^„ - ^^S^„| = Op{(logn 4/n)l/2} = |c^S,aml„ - C^nir 

If, in addition, the all elements in ^„ are positive, then the latter rate can be 
replaced by op{(logn d^/n)^/^}. 

The above result shows that for risk management where the portfolio risk is 
^^S^„, no substantial gain can be realized by using the factor model. Indeed, 
there is no substantial gain for estimating the covariance matrix even if the factor 
model is correct. These have also convincingly been demonstrated in |31| using 
simulation studies. Fan et al ,31| also gives the order dn under which the covariance 
matrix can be consistently estimated. 

The substantial gain can be realized if S^^ is estimated. Hence, the factor 
model can be used to improve the construction of the optimal mean-variance port- 
folio, which involves the inverse of the covariance matrix. Let us quote one theorem 
of PJj- See other results therein for optimal portfolio allocation. 

Theorem 8. Under some regularity conditions, if dn = n" , then for < a < 2, 

d~Hr(S-i/2SS-i/2 - I^y = Op(n~2/3) 

with P = min(l/2, 1 - a/2), whereas for a < 1, (i,7Hr(S-i/2Ssa™S-i/2 _ j2 ^ 
Op{dn/n). In addition, under the Frohenius norm. 



- S-i|p = 0(4 logn/n) = ||S,-i„ - ^-^\\\ 
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6.4. Machine learning and data mining. In machine learning, our 
goal is to build a model with the capability of good prediction of future observa- 
tions. Prediction error depends on the loss function, which is also referred to as a 
divergence measure. Many loss functions are used in the literature. To address the 
versatility of loss functions, let us use the device introduced by For a concave 
function q{-), define a g-class of loss function £{■, •) to be 

£{y, fh) = q{fh) - q{y) - q'{m){m - y) (6.12) 

where rh = m(x), an estimate of the regression function to(x) = E{y\x). Due to 
the concavity of q, £(•, •) is non-negative. 

Here are some notable examples of ^-loss constructed from the g-function. For 
binary classification, y E {—1,1}. Letting q{m) = 0.5min{l — to, 1 + m} yields 
the misclassification loss, £i{y,m) = I{y ^ J(to > 0)}. Furthermore, £2(2/,™) = 
[1 — ysgn(TO)]_|_ is the hinge loss if q{rn) = ^minjl — to, 1 -I- to}. The function 
93(w) = Vl — w? results in £3{y,m) = exp{— 0.5?/log{(l + to)/(1 — fh)}, the 
exponential loss function in AdaBoost Taking q{m) = cm — for some 

constant c results in the quadratic loss €4(2/, fh) = {y — fh)'^. 

For a given loss function, we may extend the PLS to a penalized empirical risk 
minimization (|4.5|l . The dimensionality d of the feature vectors can be much larger 
than n and hence the penalty is needed to select important feature vectors. See, 
for example, 0] for an important study in this direction. 

We next make a connection between the penalized loss function and the popu- 
larly used support vector machines (SVMs), which have been successfully applied 
to various classification problems. In binary classification problems, the response y 
takes values either 1 or —1, the class labels. A classification rule (5(x) is a mapping 
from the feature vector x to {1, —1}. Under the 0-1 loss, the misclassification error 
of S is P{y ^ ^(x)}. The smallest classification error is the Bayes error achieved 
by argmin^g|;^ _jjP(y = c|x). Let {x^, y^}, i = 1, • • • , n be a set of training data, 
where x^ is a vector with d features, and the output yi g {1,-1} denotes the class 
label. The 2-norm SVM is to find a hyperplane x^/3, in which xn = 1 is an in- 
tercept and f3 = /J^^^)"^, that creates the biggest margin between the training 
points from class 1 and —1 |84j : 

max ^ subject to ^^(/S'^Xi) > 1 - ^j,V i, ^i>0,y^£i< B, (6.13) 

13 11/3(2)11 ^ 

where are slack variables, and _B is a pre-specified positive number that controls 
the overlap between the two classes. Due to its elegant margin interpretation and 
highly competitive performance in practice, the 2-norm SVM has become popular 
and has been applied for a number of classification problems. It is known that the 
linear SVM has an equivalent hinge loss formulation |T7| 

n d 

3 = argmin^ ^[1 - 2/,(xf/3)]+ + A^/3|. 

i=l j=2 

Lin |HJ shows that the SVM directly approximates the Bayes rule without estimat- 
ing the conditional class probability because of the unique property of the hinge 
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loss. As in the ridge regression, the L2-penalty helps control the model complexity 
to prevent over-fitting. 

Feature selection in the SVM has received increasing attention in the literature 
of machine learning. For example, the last issue of volume 3 (2002-2003) of Journal 
of Machine Learning Research is a special issue on feature selection and extraction 
for SVMs. We may consider a general penalized SVM 

n d 

3 = argmin^^[l - y.(xf/3)]+ + J^pa, 

The 1-norm (or LASSO-hke) SVM has been used to accomphsh the goal of au- 
tomatic feature selection in the SVM (|HH1)- Friedman et al 001 shows that the 
1-norm SVM is preferred if the underlying true model is sparse, while the 2-norm 
SVM performs better if most of the predictors contribute to the response. With 
the SCAD penalty, the penalized SVM may improve the bias properties of the 
1-norm SVM. 
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